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FOREWORD 


Army  training  developers  need  tools  to  aid  in  the  design, 
acquisition,  and  use  of  simulation-  and  computer-based  programs 
of  instruction  for  weapon  operation  and  maintenance.  One 
critical  need  is  a  job  aid  for  the  design  and  evaluation  of 
training  devices  during  all  stages  in  the  weapon  acquisition 
cycle. 

This  series  of  three  reports  describes  one  approach  to  such 
aiding — a  hybrid  of  decision  analysis  and  mathematical  modeling. 
The  approach  provides  numerical  estimates  of  device  effective¬ 
ness  which  are  based  on  expert  ratings  of  trainee  and  task 
characteristics,  functional  and  physical  similarity  between 
the  proposed  device  and  the  operational  equipment,  and  the 
instructional  characteristics  of  the  device.  It  is  an  analytic, 
computer-based  technique — a  menu-driven  system — which  can  be 
used  at  any  stage  of  training  device  design. 

The  product  of  this  research  can  help  training  device 
procurers  such  as  PM-TRADE  and  training  developers  in  TRADOC 
make  better  documented  decisions  about  training  device  design. 
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Porecasting  Device  Effectiveness:  III.  Analytic 
Assessment  of  DEFT 

EXECUTIVE  SUMMARY _ 

Requirement: 

To  analytically  address  the  numeric  and  scalar  proper¬ 
ties  of  the  Device  Effectiveness  Forecasting  Technique 
(DEFT);  to  conduct  an  examination  of  interrater  agreement 
by  analyzing  three  training  devices. 

Procedure: 

Several  analytic  procedures  were  conducted  to  address 
various  aspects  of  the  scalar  properties  of  DEFT.  These 
procedures  included  Monte  Carlo  simulations  to  assess  the 
interpretation  of  DEFT  output,  sensitivity  of  DEFT  para¬ 
meters,  comparison  of  outputs,  stability,  and  interrater 
agreement . 

Findings: 

Results  indicated  that  it  would  be  necessary  to  encor- 
porate  assumptions  regarding  expected  distributions  of  in¬ 
put  variables  in  order  to  meaningfully  interpret  DEFT  out¬ 
put.  Also,  the  Monte  Carlo  analyses  demonstrated  the  sen¬ 
sitivity  of  DEFT  output  scores  to  variations  in  inputs,  and 
assessed  the  effects  of  various  assumptions  regarding 
measurement  error  on  output  scores. 

The  interrater  agreement  issue  was  addressed  by  having 
several  raters  apply  DEFT  to  three  actual  training  devices. 
Results  indicated  a  high  degree  of  consistency  among  raters 
for  all  devices  and  for  all  levels  of  DEFT. 

Otilization  of  Findings: 

These  findings  indicate  that,  with  few  modifications, 
DEFT  can  be  used  effectively  and  reliably  to  analytically 
evaluate  training  device-based  training  systems. 
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Introduction 


This  report  is  submitted  in  partial  fulfillment  of 
Contract  903-82-C-0414  between  the  Army  Research 
Institute  (ARI)  and  the  American  Institutes  for  Research 
(AIR).  It  is  part  of  a  progammatic  effort  to  develop  and 
analytically  evaluate  a  model  designed  to  forecast  training 
device  effectiveness.  Speci f ical ly ,  this  report  describes 
the  analytic  evaluation  phase  of  the  effort. 

Previous  reports  in  this  series  have  discussed  issues 
related  to  the  evaluation  of  a  training  system  (Rose  & 
Wheaton,  1984a) ,  and  presented  an  analytic  model  (Rose  & 
Wheaton,  1984b).  This  model,  named  the  Device 
Effectiveness  Forecasting  Technique  (DEFT) ,  incorporates 
numerous  ratings  and  judgments  regarding  components  of  the 
training  situation  and  the  operational  performance  require¬ 
ment  and  generates  forecasts  of  training  device  effective¬ 
ness.  In  lieu  of  empirical  tests.  Rose  and  Wheaton  (1984a) 
outlined  several  analytic  methods  that  could  be  employed  to 
assess  the  adequacy  of  such  a  model. 

Decisions  and  Designs,  Inc.  (DDI)  and  AIR  employed 
five  such  methods  in  the  evaluation  of  DEFT: 


•  Interpretation  of  output--what  sorts  of  results  can 
be  expected  from  DEFT? 

•  Sensitivity  analysis — what  is  the  impact  on  DEFT 
output  of  varying  input  parameter  values? 

•  Comparison  of  outputs--what  do  differences  in 
scores  received  by  various  devices  mean? 

•  Stabi 1 ity--what  is  the  impact  of  disagreement  be¬ 
tween  raters  on  component  scores? 

•  Interrater  agreement--apply i ng  DEFT  to  three  train¬ 
ing  devices,  to  what  extent  do  raters  agree  for 
each  of  the  various  ratings  and  judgments? 

The  first  four  of  these  questions  were  addressed  using 
Monte  Carlo  analysis.  The  general  approach  was  to  simulate 
applications  of  the  DEFT  model  by  generating  5,000  random 
values  (within  the  appropriate  ranges)  for  each  of  the 
various  DEFT  inputs  (Performance  Deficit,  Difficulty, 
etc.)*  and  combining  them  according  to  the  DEFT  formulae, 
yielding  5,000  DEFT  output  scores.  For  the  "interpretation 
of  output"  issue,  this  analysis,  repeated  under  different 

♦For  details  regarding  the  components  of  DEFT,  combination 
rules,  output  variables,  and  rating  procedures,  see  Rose  & 
Wheaton,  1984(b). 


conditions,  constituted  the  entire  computational  activity. 
Sensitivity  analysis  was  performed  using  a  variation  on  the 
basic  analysis:  Random  values  were  generated  for  all  but 
one  of  the  input  parameters;  to  examine  the  sensitivity  of 
the  output  score  to  the  value  of  the  remaining  input  param¬ 
eter,  this  parameter  was  stepped  through  its  range  of 
values  in  an  orderly  fashion,  and  output  scores  were  com¬ 
puted  for  each  of  the  values  that  it  assumed.  For  "com¬ 
parison  of  outputs,"  the  basic  analysis  was  performed  twice 
to  obtain  two  5,000-element  vectors  of  output  scores.  One 
vector  was  subtracted  from  the  other,  resulting  in  a  vector 
of  differences.  A  frequency  distribution  computed  for  this 
vector  allows  significance  testing  of  difference  values. 
Finally,  the  impact  of  less  than  perfect  interrater 
stability  was  explored  by  simulating  "measurement  error" 
and  scale  bias  and  examining  their  effects  on  the  DEFT 
output . 

The  basic  procedure  for  assessing  interrater  agreement 
was  to  have  six  raters  apply  DEFT  to  three  training 
devices.  Model  outputs  were  compared  using  various  statis¬ 
tical  techniques.  This  document  presents  the  results  of 
the  five  sets  of  analyses.  First,  we  will  present  the 
general  technical  approach  to  the  Monte  Carlo  analyses, 
followed  by  those  results.  We  will  then  present  the 
details  of  the  interrater  agreement  study. 


2.  Monte  Carlo  Analyses 


General  Technical  Approach  to  the  Monte  Carlo  Analysis 

As  we  mentioned  in  the  introduction,  Monte  Carlo 
analysis  was  used  to  simulate  applications  of  DEFT  in  orde 
to  address  each  of  the  four  basic  questions  (interpreta¬ 
tion,  sensitivity,  comparison  of  outputs,  and  stability). 

*  Eight  input  variables  were  used  in  these  analyses: 

•  Performance  Deficit  (PD) 

•  Difficulty  (D) 

•  Training  Acquisition  Efficiency  (AE) 

•  Residual  Performance  Deficit  (RPD) 

•  Residual  Learning  Difficulty  (RLD) 

•  Physical  Similarity  (PS) 

•  Functional  Similarity  (FS) 

•  Transfer  Efficiency  (TT) 

*  Abbreviations  are  those  used  in  report  II. 


These  variables  are  obtained  in  different  ways  for  each  of 
the  three  levels  of  DEFT.  However,  since  these  different 
methods  all  result  in  equivalent  scales  (e.g.,  "Performance 
Deficit"  has  a  range  of  0-100  for  all  three  DEFT  levels) , 
it  was  decided  to  use  these  variables  in  the  Monte  Carlo 
analyses . 

Since  the  distribution  of  DEFT  outputs  (the  basic 
product  of  each  analysis)  depends  on  the  distribution  of 
the  inputs,  selection  of  input  distributions  was  key. 
Because  DEFT  is  a  new  tool  that  has  not  been  applied  to  the 
evaluation  of  a  large  number  of  training  devices,  no  em¬ 
pirical  distributions  of  inputs  currently  exist. 

Therefore,  it  was  necessary  to  use  artificial  input  dis¬ 
tributions.  The  analysts  working  on  this  task  selected  the 
uniform  distribution  (i.e.,  all  input  values  have  the  same 
probability  of  being  selected)  as  the  standard  for  input  to 
the  Monte  Carlo  analyses.  This  represents  an  extremely 
conservative  approach;  it  was  selected  to  provide  a  "worst 
case"  baseline  for  comparisons  with  other  sets  of 
assumptions. 

In  addition  to  selecting  a  distributional  form  for  in¬ 
put  to  the  analyses,  it  was  necessary  to  decide  on  the 
number  of  trials  or  simulated  model  applications  for  each 


analysis.  The  selection  criterion  used  for  the  number  of 
trials  was  the  degree  of  convergency  of  (1)  a  distribution 
of  data  points  generated  randomly  from  an  underlying 
uniform  distribution  with  (2)  the  theoretical  uniform  dis¬ 
tribution.  Convergence  was  examined  for  numbers  of  trials 
ranging  from  1,000  to  9,000.  The  number  5,000  was  chosen, 
finally,  because  it  is  cost-effective  for  this  application; 
convergence  is  almost  as  good  for  5,000  trials  as  for  9,000 
trials,  and  substantially  less  computing  power  is  required. 

Thus,  each  Monte  Carlo  analysis  of  DEFT  output  simu¬ 
lates  5,000  random  applications  of  the  DEFT  model.  This 
basic  analysis  was  performed  under  a  variety  of  conditions 
that  depended  upon  the  question  to  be  answered.  Tabular 
and,  where  appropriate,  graphic  presentations  of  results 
appear  in  the  following  sections. 

Interpretation  of  Output 

The  objective  of  this  first  set  of  analyses  was  to  ex¬ 
plore  the  distributional  character istics  of  the  DEFT  out¬ 
put.  This  was  done  under  five  different  conditions,  three 
using  uniform  distributions,  and  two  using  truncated  normal 
distributions.  The  conditions  were: 


1)  Uniform  input  distributions;  denominator  input 
variables  (i.e.f  acquisition  and  transfer  ef¬ 
ficiency  measures  [see  Rose  Rose  &  Wheaton,  1984b, 
Chapter  6])  range  from  one  to  100;  all  others 
range  from  zero  to  100.  Inputs  combined  using 
initial  DEFT  model. 

2)  Uniform  in- at  distributions;  all  input  variables 
range  from  one  to  100.  Inputs  combined  using  ini¬ 
tial  DEFT  model. 

3)  Uniform  input  distributions;  all  inputs  range  from 
one  to  100.  Square  root  taken  of  denominator  (ef¬ 
ficiency)  variables  (e.g.,  AE  *  yR/i00  instead  of 
AE  =  R/100;  otherwise,  combination  identical  to 
initial  DEFT  model. 

4)  Input  distributions  truncated  normal.  Inputs  com¬ 
bined  using  initial  DEFT  model. 

5)  Input  distributions  truncated  normal.  Square  root 
taken  of  efficiency  variables.  Otherwise,  com¬ 
bination  identical  to  initial  DEFT  model. 

Tables  1  through  3  summarize  results  for  intermediate 
and  output  variables  under  Conditions  1,  2,  and  3.  In 


these  tables: 


Table  1.  CONDITION  1  RESULTS — UNIFORM  INPUT; 
INITIAL  RANGES  AND  COMBINATIONS 

DESCRIPTIVE  STATISTICS  FOR  MODEL  DEFT 
5000  TRIALS 


VARIABLE 

MEAN 

variance 

STD  DEV 

MINIMUM 

MAXIMUM 

TP 

24*87 

491 .21 

22.16 

.00 

99.00 

ACQ(A) 

131 .36 

177317.58 

421 .09 

.00 

8722.00 

AD 

16.76 

555.15 

23.56 

.00 

99.00 

TRP 

41  .71 

1039.74 

32.25 

.00 

168.22 

TRANS(T) 

217.69 

390344.31 

624. 7 J 

.00 

11967.00 

TOTAUA+T) 

349.04 

572816.19 

756.85 

.00 

12268.29 

Table  2. 

CONDITION  2  RESULTS- 

--UNIFORM 

INPUT; 

ALL 

RANGES  1- 

100;  INITIAL 

COMBINATION 

DESCRIPTIVE  STATISTICS  FOR  MODEL  DEFT 
5000  TRIM.* 


VARIABLE 

MEAN 

VARIANCE 

STD  DEV  MINIMUM 

MAXIMUM 

TP 

25.12 

486.31 

22.05 

.04 

100.00 

ACQ  (A) 

131 .75 

150900.78 

388.46 

.06 

8700.00 

AD 

16.99 

557.64 

23.61 

.0© 

97.00 

TRP 

42.59 

1069.2© 

32.70 

.06 

188.00 

TRANS (T) 

21 1 .42 

398557.33 

631 .31 

.07 

11450.00 

TOTAL  (A+T) 

343.17 

555211 .20 

745.12  1 

.63 

11466.21 

Table  3 . 

CONDITION  2  RESULTS- 

—UNIFORM  INPUT; 

ALL  RANGES 

1-100; 

SQUARE  ROOT 

TRANSFORMATION 

DESCRIPTIVE  STATISTICS 

FOR  MODEL  DEFT 

5000  TRIALS 

VARIABLE 

MEAN 

VARIANCE 

STD  DEV 

MINIMUM 

MAXIMUM 

TP 

25.37 

488.26 

22.10 

.03 

99.00 

ACQ  (A) 

47.23 

3751 .51 

61  .25 

.06 

872.20 

AD 

16.58 

544.03 

23.32 

.00 

98.00 

TRP 

42.04 

1026.01 

32.03 

.08 

1  60  oo 

TRANS (T) 

78.33 

8156.73 

90.31 

.09 

1201 .60 

TOTAL  (A+T) 

125.56 

11770.24 

108.49 

.96 

1284.90 

« 
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TP  =  Training  Problem 
(A)  ACQ  =  Total  Acquisition  Score 


AD  =  Additional  Deficit 
TRP  =  Transfer  Problem 
(T)  TRANS  =  Total  Transfer  Score 
(A+T)  TOTAL  =  Total  Score. 

The  most  striking  features  of  these  results  are  the 
high  variances  displayed  in  Conditions  1  and  2;  the  output 
distributions  are  extremely  diffuse  given  uniform  input 
distributions.  In  Condition  3,  the  output  distributions 
are  substantially  tighter  because  of  the  square  root  trans¬ 
formation  in  the  denominators  (the  transformation  makes  the 
denominator  larger,  narrowing  the  range). 

Since  the  obtained  values  for  the  variance  of  scores 
in  the  first  two  conditions  would  make  the  interpretation 
of  DEFT  output  relatively  meaningless,  we  decided  to  modify 
the  assumption  of  uniform  input  distributions.  Based  on 
our  familiarity  with  training  devices  in  general,  and  with 
U.S.  Army  training  devices  in  particular,  we  hypothesized 
distributions  for  each  input  parameter.  The  truncated  nor¬ 
mal  input  distributions  for  Conditions  4  and  5  were  the 
following: 
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VARIABLE 


MODE 


RANGE 


PD 

(Performance  Deficit) 

70 

30-90 

D 

(Difficulty) 

55 

10-100 

AE 

(R)  (Training  Efficiency) 

65 

25-100 

RPD  (Residual  Performance 

Deficit) 

30 

1-65 

(RLD) 

RD 

(Residual  Learning  Difficulty) 50 

10-90 

PS 

(Physical  Similarity) 

80 

30-100 

FS 

(Functional  Similarity) 

70 

45-100 

(TT) 

RR 

(Transfer  Efficiency) 

35 

10-90 

These  distributions  were  obtained  by  transforming  a  stan¬ 
dard  normal  distribution  centered  at  zero  and  truncated  at 
-3  and  +3.  The  mode  of  the  standard  normal  distribution 
(always  zero)  was  mapped  to  the  mode  of  the  target  range,  and 
the  truncated  value  of  -3  was  mapped  to  the  endpoint  furthest 
below  the  mode  (e.g.,  for  a  mode  of  70  and  a  range  of  30-90, 
-3  was  mapped  to  30) ;  finally,  the  target  distribution  was 
truncated  appropriately  at  the  other  end  of  the  range. 

Results  for  Conditions  4  and  5  are  summarized  in 
Tables  4  and  5.  Variances  are  substantially  lower  for  both 
of  these  conditions  than  for  Conditions  1  through  3,  be¬ 
cause  of  the  changes  in  the  assumptions  about  input 
distributions;  and  variance  is  lower  for  Condition  5  than 
Condition  4  on  account  of  the  square  root  transformation. 


Table  5.  CONDITION  5  RESULTS — TRUNCATED  NORMAL 
INPUT;  SQUARE  ROOT  TRANSFORMATION  OF  DENOMINATOR 


DESCRIPTIVE  STATISTICS  FOR  MODEL  DFFT  ( RESTRICTED  RANGES  > 

5000  TRIALS 


ARIABLE 


MEAN 


VARIANCE 


;td  dev 


MINIMUM 


MAX  I  HUM 


PD 

*'  O  4  0 

4  I  aL 

133.16 

1  1  A  54 

30.32 

C*  C1  C1  CJ 

D 

51.81 

223. 1  9 

*  o* 

1  1  A  /  « 

11.02 

O  /.  *■» 

V  /  A  IJ  A- 

R  (AE) 

65.01 

171 .62 

13.10 

26.81 

1 00 . 00 

RPP 

"7  1  ~X 

w*  •-  a  4L.  w 

4  0~?  /(A 

1  4b-  1  A  (  V 

11.29 

1.10 

6 1 . 7 1 

RD  (RLD) 

50.10 

177.50 

13.32 

1  0.32 

99 . 60 

PS 

76.1  1 

187.78 

1  3.70 

-rr>  a  a 

' J  V  4  4  V 

o  o  e  ■“ 

/  /  •  <  w 

FS 

70 . 1  7 

91 .65 

o  er  •? 

/  A  _/  1 

15.1  2 

O  c .  - 

/  A  1 

RR  (TT) 

to  *r~r 

vJL'  *  -<0 

O  TT  O  7  tr 

*.  J  /  A  vj  _/ 

15.17 

10.01 

»3C'  *=*  .1 

V  /  A  -4  f 

TP 

^  .  ^3 

111.59 

12.02 

5.85 

81  .07 

ACQ  (A) 

17.01 

253.81 

15.83 

6.83 

122.90 

AD 

1  0 . 23 

127.95 

11.31 

.00 

52.86 

TRP 

25.13 

176.13 

13.28 

.  83 

78.05 

TRANS  (T) 

11.07 

671.29 

25.97 

1.10 

<0-7  O  O 

1  /  —•  a  i-.* 

TOTAL  (A+T) 

91.11 

896.99 

29.95 

18.51 

o-?-;  -2/. 

A.  O  %_•  A  1  <- 

Thus,  based  on  some  reasonable  assumptions  regarding 
the  distribution  of  expected  input  values,  we  see  that  DEFT 
outputs  are  interpretable  and  meaningful  in  both  an  ab¬ 
solute  and  a  relative  sense.  For  example,  a  device  receiv¬ 
ing  a  Training  Problem  (TP)  score  of  65.0  could  be  inter¬ 
preted  as  addressing  a  "larger"  problem  than  a  typical 
device  (mean  =  37.33,  s.d.  =  12.02,  Condition  4). 
Differences  between  ratings  for  two  devices  on  obtained 
scores  could  be  interpreted  with  reference  to  expected 
scores . 

Sensitivity 

Eight  sensitivity  analyses  were  performed,  one  for 
each  of  the  DEFT  input  parameters.  The  objective  of  these 
analyses  was  to  explore  the  impact  of  changes  in  input  pa¬ 
rameter  values  on  the  values  of  intermediate  and  output 
variables . 

The  analyses  were  conducted  using  Condition  3  of  DEFT 
(as  described  above)--all  input  variables  are  assumed  to  be 
distributed  uniformly  between  one  and  100;  training  and 
transfer  efficiency  variables  are  subjected  to  square  root 


transformations . 


Table  6  shows  DEFT  results  when  all  inputs  vary 
freely;  Tables  7  through  14  show  how  these  results  vary 
with  systematic  variation  of  each  input  parameter. 

As  might  have  been  expected,  the  efficiency  variables 
have  the  largest  effect  on  the  means  and  standard  devia¬ 
tions  of  the  output  scores.  For  example,  across  the  range 
of  input  values,  changing  training  efficiency  scores 
produces  variations  in  the  Total  Score  mean  from  334.0  to 
103.5,  and  changes  the  standard  deviation  from  140  0  to 
96.0.  In  general,  varying  each  of  the  other  inputs  changes 
the  Total  Score  by  approximately  100  points  and  the  stan¬ 
dard  deviation  by  approximately  40  points. 

Another  way  of  looking  at  these  results  is  to  say  that 
all  scales  (except  Efficiency)  have  equivalent  effects  on 
the  Total  Score — an  extreme  value  on  any  single  scale  will 
have  the  same  effect  as  an  extreme  value  on  any  other. 
Hence,  all  scales  are  "weighted"  equally.  The  logical  (and 
analytic)  exceptions  are  the  efficiency  scales:  a  device 
that  incorporates  poor  training  or  transfer  principles 
would  be  expected  to  have  a  larger  effect  on  training  time, 
expense,  and  effort  than  any  single  component,  since  poor 
techniques  will  affect  all  aspects  of  the  training  and/or 
transfer  problem. 


Table  6.  DESCRIPTIVE  STATISTICS  FOR  DEFT — 

FOR  COMPARISON  WITH  SENSITIVITY  ANALYSES 

DESCRIPTIVE  STATISTICS  FOR  MODEL  DEFT  SENSITIVITY  ANALYSIS 

5000  TRIALS 


NAME 

MEAN 

VARIANCE 

ST  DEV 

MINIMUM 

MAXIMUM 

PD 

50  *  73 

822.50 

28.68 

1  .00 

100.00 

D 

50.09 

832.99 

28.86 

1  .00 

100.00 

R  (AE) 

51  .04 

829.88 

28.81 

1  .00 

100.00 

RPD 

50.25 

829.97 

28.81 

1  .00 

100.00 

RD  (RLD) 

50.53 

826.55 

28.75 

1  .00 

100.00 

PS 

50.17 

829.67 

28.80 

1  .00 

100.00 

FS 

50.59 

846.28 

29.09 

1  .00 

100.00 

RR  (TT) 

50.45 

834.94 

28.90 

1  .00 

100.00 

TP 

25.53 

498.47 

22.33 

.03 

99.00 

ACQ  (A) 

47.08 

3654.26 

60.45 

.03 

837.20 

AD 

16.79 

559.71 

23.66 

.00 

98.00 

TRP 

42.00 

1035.41 

32.18 

.02 

179.14 

TRANS  (T) 

79.74 

9694.16 

98.46 

.03 

1211.60 

TOTAL  (A+T) 

126.83 

13410.99 

115.81 

1 .16 

1 266 . 40 
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Table  7.  SENSITIVITY  ANALYSIS  FOR  PD 


Table  9.  SENSITIVITY  ANALYSIS  FOR  R  (AE) 


Table  10.  SENSITIVITY  ANALYSIS  FOR  RPD  (PD') 


Table  12.  SENSITIVITY  ANALYSIS  FOR  PS 
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Comparison  of  Outputs 


The  objective  of  the  "comparison  of  outputs"  analysis 
is  to  determine  the  probability  of  any  given  level  of  dif¬ 
ference  between  two  DEFT  TOTAL  scores.  To  this  end,  two 
DEFT  TOTAL  output  vectors  (5000  data  points  each)  were 
generated,  and  one  was  subtracted  from  the  other  to  obtain 
a  frequency  distribution  of  differences.  Table  15  sum¬ 
marizes  the  three  distributions. 

It  should  be  noted  that  the  two  TOTAL  distributions 
were  generated  using  Condition  3  above,  which  assumes 
uniformly  distributed  inputs;  as  was  noted  before,  this  is 
an  extremely  conservative  assumption. 

Figure  1  shows  a  frequency  distribution  of  the  dif¬ 
ferences;  as  is  to  be  expected,  the  differences  are  dis¬ 
tributed  approximately  normally  with  a  mean  very  close  to 
zero . 

Table  16  summarizes  the  probability  distribution  based 
on  this  analysis.  This  table  can  be  used  to  determine 
statistical  signif icance,  although  it  is  extremely  conser¬ 
vative  due  to  the  underlying  distributional  assumptions. 
According  to  this  table,  two  devices  would  need  to  differ 
by  approximately  150  points  in  the  Total  Score  to  be  judged 


Table  15.  DESCRIPTIVE  STATISTICS  FOR  DEFT  CONDITION  3 

DIFFERENCE  ANALYSIS 


DESCRIPTIVE  STATISTICS  FOR  MODEL  DEFT  DIFFERENCE  ANALYSIS 

5000  TRIALS 


NAME 

MEAN 

VARIANCE 

ST  DEV 

MINIMUM 

MAXIMUM 

T0TA1.1 

126.52 

12B69.B7 

113.45 

.87 

1335.32 

TOTALS 

126.51 

12320. AO 

111.00 

1  .56 

1163.96 

DIFFER 

.01 

24404.78 

156.22 

"1118.48 

1222.42 
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DISTRIBUTION  Of  DIFFERENCE  FOR  MODEL  DErt 
3000  TRIALS 


Figure  1.  FREQUENCY  DISTRIBUTION  OF  DEFT  CONDITION  3  DIFFERENCES 
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as  "different"  at  the  0.10  probability  level.  Much  more 
realistic  is  a  difference  based  on  the  restricted  ranges 
generated  in  Conditions  4  and  5,  described  earlier.  In 
these  cases,  for  example,  a  difference  of  30  points  in  the 
Total  Score  (Condition  5)  would  make  two  devices  a  standard 
deviation  apart. 

Stability  Analyses 

The  purpose  of  the  stability  analyses  was  to  examine 
the  impact  of  deviations  from  perfect  reliability.  It  is 
normally  assumed  that  a  rather  high  degree  of  stability  is 
necessary  to  demonstrate  the  validity  of  the  measuring  in¬ 
strument  and/or  the  robustness  of  the  effect  being 
measured.  Establishing  the  existence  of  the  desired  degree 
of  stability  is  an  empirical  endeavor  (e.g.,  through 
repeated  observations  of  raters);  nonetheless,  Monte  Carlo 
analyses  can  be  used  to  hypothetically  examine  the  poten¬ 
tial  impact  of  instability. 

Two  kinds  of  Monte  Carlo  analyses  were  performed.  The 
scale  bias  analysis  shows  the  impact  of  preferences  for 
certain  portions  of  the  input  scale.  The  two-judge  random 
error  analysis  examines  the  effect  of  measurement  error  on 
apparent  stability.  Results  of  this  analysis  can  be  used 
for  null  hypothesis  testing. 
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Impact  of  scale  bias.  Table  17  summarizes  the  results  of 
the  scale  bias  analysis,  which  investigates  the  impact  of  a 
rater's  preference  for  any  specific  portion  cf  the  allow¬ 
able  1-100  scale.  Inputs  are  assumed  to  be  uniformly  dis¬ 
tributed;  each  row  in  Table  17  represents  a  different  range 
from  which  the  values  for  all  input  variables  are  drawn. 

The  first  row,  provided  for  comparison,  shows  intermediate 
and  output  variable  results  for  the  unbiased  case,  in  which 
the  entire  1-100  range  is  used.  Subsequent  rows  show 
results  for  cases  in  which  simulated  judgments  (i.e.,  input 
values)  are  confined  to  smaller  portions  of  the  scale. 

Two-judge  random  error  analysis.  As  has  already  been  men¬ 
tioned,  Monte  Carlo  analysis  cannot  be  used  to  determine 
the  degree  of  stability;  this  is  an  empirical  question. 
However,  investigation  can  be  made  of  the  impact  of 
measurement  error  on  apparent  stability.  In  particular, 
suppose  that  two  judges  are  in  agreement  about  all  aspects 
of  a  device,  but,  due  to  measurement  error,  their  ratings 
do  not  coincide  perfectly.  How  does  this  affect  their  ap¬ 
parent  agreement? 

To  investigate  this  question,  five  sets  of  simulated 
DEFT  model  output  were  generated.  The  first  set  represents 
the  "truth"  in  the  form  of  5,000  random  applications  of 


Table  17.  SCALE  BIAS  ANALYSIS  FOR  DEFT 
(UNIFORM  INPUT  DISTRIBUTIONS) 

SCALE  BIAS  ANALYSIS  FDR  MODEL  DEFT 
5000  TRIALS 
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DEFT  in  which  two  judges  in  fact  agree  perfectly  on  each 
and  every  input  value.  Table  18  summarizes  this  set  of 
output  (generated  under  Condition  3).  The  other  four  sets 
of  DEFT  output  represent  various  kinds  of  "imperfection"  in 
the  form  of  deviation  about  the  "truth"  values.  Tables  19 
through  22  summarize  DEFT  results  for  hypothetical  judges 
whose  ratings  (input  values)  vary  randomly  about  the  "true" 
value. 

In  Tables  19  and  20,  the  random  variation  is  uniform 
over  the  interval  true  value  +5  (interval  width  10);  in 
Tables  21  and  22,  the  variation  is  uniform  over  the  inter¬ 
val  true  value  +10  (interval  width  20). 

Table  23  summarizes  distributions  of  difference  in 
DEFT  TOTAL  among  the  various  data  sets.  The  first  row 
(DIF10J1X)  describes  the  variation  of  hypothetical  Judge 
l's  DEFT  TOTAL  about  "truth's"  DEFT  TOTAL  when  Judge  1  is 
assumed  to  be  reliable  to  +  5;  the  second  row  (DIF10J2X) 
summarizes  the  same  variation  for  hypothetical  Judge  2. 

The  third  row  (DIF10J1J2)  summarizes  the  distribution  of 
differences  between  Judge  1  and  Judge  2's  DEFT  TOTALS  when 
the  two  judges  are  assumed  to  be  in  perfect  agreement,  and 
each  is  reliable  to  +5.  The  fourth  through  sixth  rows 
repeat  the  first  through  third  rows  for  hypothetical  judges 
that  are  reliable  to  +10  (interval  width  20). 


Table  18.  HYPOTHETICAL  "TRUE"  RESULTS  FOR  DEFT 


DESCRIPTIVE  STATISTICS  FOR  MODEL  DEFT  INTER-RATER  ANALYSIS  —  TRUE  VALUE 

5000  TRIALS 


NAME 

MEAN 

VARIANCE 

ST  DEV 

MINIMUM 

MAXIMUM 

PD 

51  .28 

841 . 1  7 

29.00 

1  .00 

1 00 . 00 

I'< 

50.25 

817.00 

28.58 

1  .00 

100.00 

R  (AE) 

50.78 

813.28 

28.61 

1  .00 

1 00 . 00 

RPD 

51.64 

832.52 

28.35 

1  .00 

100.00 

RD  (RLD) 

49.77 

832.99 

28.86 

1  .00 

1 00 . 00 

PS 

50.54 

821 .20 

28.66 

1  .00 

1 00 . 00 

FS 

50.24 

821 .77 

28.67 

1  .00 

1 00.00 

RR  (TT) 

50.80 

826.48 

28.75 

1  .00 

1 00 . 00 

TP 

25.84 

498.07 

nn  ~in 

A.  •  \J  w 

.03 

98.00 

ACQ  (A) 

48.23 

3952.76 

62.37 

.04 

309.90 

AD 

16.75 

558.34 

23.63 

.00 

98 . 00 

TRP 

42.50 

1043.59 

32.30 

.06 

1 75.4? 

TRANS  (T) 

78.00 

8163.48 

90.35 

.06 

1187.50 

TOTAL  (A+T) 

126.23 

121 43.31 

110.20 

.98 

1240.1 1 

Table  19.  RESULTS  FOR  HYPOTHETICAL  JUDGE  1 — 
DEVIATION  OF  +  5  FROM  "TRUTH" 

DESCRIPTIVE  STATISTICS  FOR  MODEL  DEFT  INTER-RATER 
5000  TRIALS  ANALYSIS  —  JUDGE  ii  (INT  WIDTH  10) 


NAME 

MEAN 

VARIANCE 

ST  DEV 

MINIMUM 

MAXIMUM 

PD 

51  .'ll 

838.02 

28.95 

1  .00 

100.00 

D 

50.31 

811.60 

23.49 

1  .00 

100.00 

R  (AE) 

50.90 

809.99 

28.46 

1  .00 

100.00 

RPD 

51  .71 

825.46 

28.73 

1  .00 

100.00 

RD (RLD) 

49.84 

822.94 

28.69 

1  .00 

100.00 

PS 

50.61 

815.46 

28.56 

1  .00 

1 00 . 00 

FS 

50.30 

815.63 

28.56 

1  .00 

1 00 . 00 

RR  (TT) 

50.93 

822.00 

28.67 

1  .00 

100.00 

TP 

25 . 93 

496.69 

n  o  9  9 

.03 

99.00 

ACQ  (A) 

46.83 

31 61 .89 

56.23 

.03 

725.40 

AD 

16.6  8 

552.63 

9*7  < 

*-  W  »  —f  1 

.00 

97.00 

TRP 

42.50 

1035.41 

32.1  8 

.03 

1 77.04 

TRAMS  (T) 

76.27 

6686.59 

81  .77 

.07 

949.00 

TOTAL  (A+T) 

123.10 

9796.67 

98.98 

1  .55 

1038.63 

30 


Table  20.  RESULTS  FOR  HYPOTHETICAL  JUDGE  2 — 

DEVIATION  OF  +  5  FROM  "TRUTH" 

DESCRIPTIVE  STATISTICS  FOR  MODEL  DEFT  INTER-RATER  ANALYSIS  -- 


JUDGE 

12  ( I NT  WIDTH 

10)  5000 

TRIALS 

NAME 

MEAN 

VARIANCE 

ST  DEV 

MINIMUM 

MAXIMUM 

PD 

51  .31 

832.1 4 

28.85 

1  .00 

100.00 

D 

50.29 

812.63 

28.51 

1  .00 

100.00 

R  (AE) 

50.80 

813.37 

28.52 

1  .00 

100.00 

RPD 

51 .64 

828.28 

28.78 

1 .00 

1 00.00 

RD  (RLD) 

49.86 

829.39 

28.80 

1  .00 

1 00 . 00 

PS 

50.68 

814.12 

28.53 

1  .00 

1 00.00 

FS 

50.35 

816.74 

28.58 

1  .00 

100.00 

RF<  (TT) 

50.87 

823.44 

28.70 

1  .00 

100.00 

TP 

25.90 

494.84 

nr>  n  a 

*  *«,  i 

.  06 

100.00 

ACQ  (A) 

47.27 

3221 .30 

56.76 

.08 

601 .60 

AD 

16.70 

557.85 

23.62 

.00 

97.00 

TRP 

42.50 

1039.92 

32 . 25 

.07 

1 75 . 65 

TRANS  (T) 

76.53 

6848.36 

82.75 

.07 

1052.80 

TOTAL  (A+T) 

123.80 

10010.21 

100.05 

1 .40 

1107.82 

Table  21.  RESULTS  FOR  HYPOTHETICAL  JUDGE  1 — 

DEVIATION  OF  +  10  FROM  "TRUTH" 

DESCRIPTIVE  STATISTICS  FOR  MODEL  DEFT  INTER-RATER  ANAL YS I 
JUDGE  f 1  (INT  WIDTH  20)  5000  TRIALS 


NAME 

MEAN 

VARIANCE 

ST  DEV 

MINIMUM 

MAXIMUM 

PD 

51  .33 

825.24 

28.73 

1  .00 

100.00 

I) 

50.17 

793.37 

28.17 

1  .00 

100.00 

R  (AE) 

50.88 

806.41 

28.40 

1  .00 

1 00 . 00 

RPI) 

51  .77 

820.94 

28.65 

1 .00 

100.00 

RD  (RLD) 

49.84 

818.94 

28.62 

1  .00 

100.00 

PS 

50.70 

808.54 

28.43 

1  .00 

100.00 

FS 

50.37 

807.97 

28.42 

1  .00 

100.00 

RR  (TT) 

50 . 84 

805 . 47 

28.38 

1  .00 

1 00 . 00 

TP 

25.81 

479.47 

21  .90 

.01 

98.01 

ACQ  (A) 

46.15 

2764 . 1  0 

52.57 

.02 

558.00 

AD 

16.63 

551 .34 

23.48 

.00 

97.00 

TRP 

42.49 

1027.32 

32.05 

.02 

187.05 

TRANS  (T) 

74.97 

6282.88 

79.26 

.03 

985.60 

TOT  AL(A+T) 

121.12 

9132.09 

95.56 

1  .65 

988.53 

Table  22.  RESULTS  FOR  HYPOTHETICAL  JUDGE  2-- 
DEVIATION  OF  +  10  FROM  "TRUTH" 

DESCRIFTI VF  STATISTICS  FOR  MODEL  DEFT  INTER-RATER  ANALYSIS' 
•JUDGE  42.  <  INT  WIDTH  20 >  5000  TRIALS 


NAME 

MEAN 

VARIANCE 

ST  DEV 

MINIMUM 

MAXIMUM 

PD 

51  .36 

824.22 

28.71 

1 .00 

100.00 

D 

50 . 34 

800.38 

23.29 

1  .00 

100.00 

R  (AE) 

50.95 

801 .64 

28.31 

1  .00 

100.00 

RPI) 

51  .72 

81 9.54 

28.63 

1  .00 

100.00 

RD  (RLD) 

50.00 

809.37 

28.45 

1  .00 

100.00 

PS 

50.64 

806.58 

28.40 

1  .00 

100.00 

FS 

50.32 

802.96 

28.34 

1  .00 

100.00 

RR  (TT) 

50.88 

806.74 

28.40 

1  .00 

100.00 

TP 

25.95 

491 .39 

22.17 

.06 

99.00 

ACQ  (A) 

46.25 

2773.98 

52.72 

.08 

631 .45 

AD 

16.62 

541 .59 

23.27 

.00 

95.00 

TRP 

42.51 

1009.59 

31 .77 

.06 

185.20 

TRANS  (T) 

75.06 

5967.60 

77.25 

.06 

901 .20 

TOTAL  (A+T) 

121  .31 

8693.74 

93.24 

1  .26 

907 . 58 

33 


Table  23.  DISTRIBUTIONS  OF  DEFT  TOTAL  DIFFERENCES 


DESCRIPTIVE  STATISTICS  FOR  MODEL  DEFT  INTER-RATER  DIFFERENCES 

5000  TRIALS 


NAME 

MEAN 

VARIANCE 

ST  DEV 

MINIMUM 

MAXIMUM 

DIF10J1X 

“3. 13 

1 689.1  1 

41.10 

“646.40 

533.59 

DJF10J2X 

"2.43 

1 620.27 

40.25 

“623.61 

458.34 

DIF10J1 J2 

“.70 

1633.79 

40.42 

“551 .94 

581 .99 

DIF20J1X 

"5.11 

3529.54 

59.41 

“755.65 

586.10 

D.TF20J2X 

"4.92 

3024.65 

55.00 

"818.98 

644.14 

DIF20J1 J2 

".19 

3144.11 

56.07 

“641 .92 

703.01 

The  utility  of  this  analysis  is  in  its  potential  for 
null  hypothesis  testing.  Given  two  (real)  judges  rating 
the  same  device,  and  a  difference  between  their  DEFT  TOTAL 
scores,  we  can  determine  the  likelihood  of  a  difference  of 
that  magnitude  or  larger  given  stability  of  _+5  or  +10  and 
an  assumption  of  no  underlying  disagreement.  Since  the 
differences  appear  to  be  distributed  normally  (see  Figures 
2  through  7),  this  test  can  be  made  using  the  standard  nor¬ 
mal  distribution.  Output  of  this  analysis  can  also  be  used 
to  determine  confidence  intervals  or  credible  intervals 
about  the  DEFT  TOTAL  computed  from  one  (real)  judge's  input 
ratings,  assuming  stability  of  +5  or  +10. 


muuim:v  oi.hr  mn  ion  or  juoce  f.'  -  iruf  value  for  iiii  uioiii  to 

5000  rRirti  s 


Figure  3.  DISTRIBUTION  OF  DEFT  DIFFERENCES  FOR  HYPOTHETICAL 
JUDGE  2  (RELIABLE  TO  +  5)  VERSUS  "TRUTH" 


FRF.QUCMCY  DISIRIBliriON  OF  HllMit  i  1  -  .JUDCE  111  FOR  !HT  WIDTH  10 

SOOO  I  RIALS 


******* 

****** 


*  ***** 

*************  ***** 
************************* 
********************  ***** 
************  ***** 
***** 


*  * 
*  *  * 

****** 
****** 
****** 
****** 
****** 
****** 
****** 
****** 
****** 
****** 
****** 
***** 
*  *  * 


Figure  4.  DISTRIBUTION  OF  DEFT  TOTAL  DIFFERENCES  FOR 
HYPOTHETICAL  JUDGE  1  VERSUS  HYPOTHETICAL  JUDGE  2  (PERFECT 


FREQUENCY  1)1 TR1KU1  J  ON  OK  JUDGE  tl  -  JUDGE  S2  EOF:  INK  WIDTH  20 

5000  TRIOir 


'•  do  t  o  rmwi'.ri 

gure  7.  DISTRIBUTION  OF  DEFT  TOTAL  DIFFERENCES  FOR 
HYPOTHETICAL  JUDGE  1  VERSUS  HYPOTHETICAL  JUDGE  2 


3.  Interrater  Agreement 


The  purpose  of  this  exercise  was  to  determine  the  de¬ 
gree  of  interrater  agreement  that  could  be  achieved  using 
DEFT.  This  exercise  also  served  as  a  "dry  run"  through  the 
DEFT  procedures-- in  essence,  a  "feasibility"  study.  Could 
DEFT  be  used  by  various  types  of  raters  with  more  or  less 
familiarity  with  the  selected  training  devices  and  more  or 
less  familiarity  with  DEFT? 

The  method  chosen  was  to  have  six  raters  use  DEFT  to 
evaluate  three  training  devices.  Two  of  the  training 
devices  were  designed  to  train  the  same  tasks  and  subtasks- 
thus,  we  had  a  "comparative"  evaluation.  The  third  train¬ 
ing  device  was  designed  to  train  several  different  tasks. 

We  selected  two  of  these  tasks.  We  chose  this  method  -- 
i.e.,  a  limited  set  of  training  devices  and  a  limited  set 
of  raters  —  rather  than  alternative  approaches  (e.g. ,  many 
raters-one  training  device,  few  raters-many  training 
devices,  many  raters-many  training  devices)  primarily  be¬ 
cause  of  time  and  resource  constraints.  However,  we  also 
viewed  this  method  as  a"worst-case"  test:  if  we  could  not 
demonstrate  agreement  in  this  situation,  we  would  not  be 


able  to  demonstrate  agreement  in  less  controlled 
situations.  Our  method  also  constrained  the  use  of 
sophisticated  statistical  evaluations.  For  example,  cor¬ 
relations  between  raters  over  repeated  measures  on  the  same 
rating  scale  could  not  be  meaningfully  interpreted  due  to 
the  small  number  of  observations.  Nonetheless,  descriptive 
statistics,  such  as  mean  differences  across  raters,  could 
provide  sufficient  information  to  determine  the  feasibility 
and  usefulness  of  DEFT. 

Method 

Devices  and  Tasks/Subtasks.  Two  armor  gunnery  train¬ 
ing  devices  were  selected:  The  MK-60  Gunnery  Trainer 
(VIGS) ,  and  the  burst-on-target  (BOT)  trainer.  These  two 
devices  were  examined  in  the  context  of  training  a  single 
gunnery  engagement,  shown  in  Figure  8  (from  Harris,  Ford, 
Tufano,  &  Wiggs,  1983).  The  third  device  selected  was  a 
maintenance  procedures  simulator.  This  was  selected  be¬ 
cause  AIR  staff  were  intimately  involved  in  its  design,  ex¬ 
tensive  materials  were  available,  and  the  tasks  selected 
for  evaluation  were  similar  to  maintenance  procedures  con¬ 
tained  in  U.S.  Army  tasks.  Brief  descriptions  of  the  three 
devices  and  the  tasks  and  subtasks  evaluated  follow. 
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I  DOC  JOB  OBJECTIVE  56 
PLUS  BOT 


Precision,  periscope,  stationary  firing  tank,  moving  tank  target 
(1200-1600)  meters),  SABOT,  direct  fire  adjustment  (BOT) 


GUNNER  BEHAVIORAL  ELEMENTS 

1.  Gunner  indexes  ammunition. 

2.  Gunner  turns  on  main  gun  switch. 

3.  Gunner  announces  IDENTIFIED. 

4.  Gunner  applies  lead  in  direction  of  target  apparent 
moti  on . 

5.  Gunner  lays  crosshair  leadline  at  center  of  target 
vulnerability. 

6.  Gunner  makes  final  precise  lay. 

7.  Gunner  announces  ON  THE  WAY. 

8.  Gunner  fires  main  gun. 

9.  Gunner  announces  sensing  and  BOT. 

10.  Gunner  relays  (BOT). 

11.  Gunner  announces  ON  THE  WAY  (BOT). 

12.  Gunner  fires  main  gun  (BOT). 


The  gunnery  engagement  and  gunner  behaviors  come  from  two  sources. 

1.  Boldovici,  J.A.  (HumRRO) ,  Boycan,  G.G.  (ARI) ,  Fingerman,  P.F.,  & 
Wheaton,  G.R.  (AIR).  M601AQS  Tank  Gunnery  Data  Handbook.  ARI 
Technical  Report  TR-79-A7,  March  1979. 

2.  U.S.  Army,  FM17-12,  Tank  Gunnery,  March  1977. 


FIGURE  8.  GUNNERY  ENGAGEMENT 
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The  gunnery  engagement  selected  (Figure  8)  was  selec¬ 
ted  for  several  reasons.  First,  AIR  staff  were  familiar 
with  it;  second,  excellent  documentation  was  available,  and 
third,  this  engagement  had  previously  been  processed 
through  earlier  versions  of  the  TRAINVJCE  models  (see 
Harris  et  al.,  1983). 

The  MPS  Trainer.  Materials  drawn  from  AIR/Bedford 
files  for  the  period  1974-1983  were  extracted  and  edited  to 
describe  the  E-3A  Navigation  Computer  System  (NCS)  and  the 
Maintenance  Procedure  Simulator  (MPS)  for  that  system.  The 
MPS  was  built  by  Honeywell  to  E-3A  design  specifications 
developed  by  AIR/Bedford. 

The  MPS  was  designed  and  acquired  to  support  training 
in  organizational  (flightline)  maintenance  procedures  for 
the  AN/ASN-118  NCS  installed  on  the  E-3A  aircraft.  The  NCS 
supplies  navigation  data  to  the  aircraft  flight  control 
system,  the  flight  crew,  and  the  radar  data  processing 
group.  The  NCS  incorporates  a  pair  of  redundant 
CAROUSEL- type  inertial  navigation  units,  a  single  doppler 
system  to  measure  altitude,  and  an  Omega  VLF  receiver/com¬ 
puter  system  to  measure  aircraft  position.  Organizational 
maintenance  of  the  NCS  relies  primarily  upon  automatic 
fault  detection  and  isolation  performed  by  built-in  test 
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equipment  (BITE) .  Isolated  faults  are  corrected  by  removal 
and  replacement  of  line-replaceable  units  (LRUs)  or 
substitutions  of  faulty  soldered  components  (inductors, 
capacitors,  filters). 

The  MPS  is  a  computer-controlled  trainer  housed  in  a 
single  integrated  console.  Operation  of  the  E-3A  aircraft 
AN/ASN-118  Navigation  Computer  System  (NCS)  is  simulated 
only  to  the  extent  required  for  performance  of  the  required 
organization-level  maintenance  procedures  for  the  NCS. 
Faults  in  the  NCS  are  simulated  through  the  action  of  com¬ 
puter  software.  Required  maintenance  actions  such  as 
removal  and  replacement,  connect  and  disconnect,  and  in¬ 
spection  are  simulated  by  the  use  of  MPS  controls  rather 
than  actual  operations. 

During  a  normal  training  situation,  the  student 
operates  controls  of  simulated  aircraft  and  support  equip¬ 
ment  contained  in  the  MPS.  The  computer  software  repeti¬ 
tively  samples  MPS  control  settings  and  causes  the  ap¬ 
propriate  response  to  be  displayed.  Software  response  to 
the  instructor/student  actions  can  cause  one  or  more  of  the 
following  to  occur: 


1)  change  to  one  or  more  indicator  displays 

2)  removal  or  change  of  35-mm  slide  displays 

3)  Teleprinter  message 

The  MPS  provides  273  training  exercises  that  are  used 
to  train  entering  E-3A  maintenance  technicians  on  the  NCS 
system-specific  operations  and  maintenance  procedures. 
Students  entering  the  training  course  have  completed  basic 
training  and  a  general  navigation  course  which  leads  to  the 
award  of  semi-skilled  (3-level)  rating  in  AFSC  328X4.  Upon 
graduation,  students  proceed  to  the  E-3A  Wing  at  Tinker 
AFB,  where  they  begin  work  on  the  flightline.  They  are  un¬ 
der  supervision  and  receive  additional  on-the-job  training. 

Table  24  describes  two  "tasks"  which  are,  in  reality, 
two  parts  of  one  of  the  273  exercises.  The  tasks  selected 
for  description  are:  (1)  Checkout  of  the  Inertial 
Navigation  System  (INS) ,  and  (2)  Fault  isolation  of  Fault 
10  (of  100).  Two  information  packages  were  prepared.  The 
first  set  represented  each  task  as  performed  in  conjunction 
with  the  operational  equipment.  The  second  set  represented 
the  same  tasks  as  performed  in  conjunction  with  MPS.  Both 
provide  data  formulated  for  direct  entry  into  the 
computerized  DEFT  program.  The  data  included  descriptive 


Table  24.  MPS  and  G-3A  Tasks  and  Subtasks 


Task  1 : 
Subtask 
10 

20 

30 

40 

50 

60 

70 

80 

Task  2: 
81 

82 

83 

84 

85 

86 

87 

88 


Checkout  of  Inertial  Navigation  System  (INS) 

Number  Subtask  Description 

Ensure  E-iA  aircraft  power  and  cooling  is 
available 

Turn  NCS  Power  on 

Turn  Autopilot  off 

Turn  (2)  probe  heaters  off 

Synchronize  (2)  Horizontal  Situation 
Indicators  (HSI) 

Set  INS-1  and  INS-2  to  align  mode 

Test  CDU  displays  and  lamps 

Detect  Fault  10.  (Performance  index  does 
not  decrease  from  9  to  5 ) 

Fault  Isolation  of  Fault  10 

Interchange  CDU-1  and  CDU-2  (Simulated  on 
MPS) 

Perform  Checkout  (Task  1:  10-80) 

Interchange  INU-1  and  INU-2  (Simulated  on 
MPS ) 

Perform  Checkout  (Task  Is  10-80) 

Check  115  VAC  Power 

Check  wiring  continuity  (resistance) 

Replace  shorted  capcititor  (Simulated  on 
MSP) 

Perform  Checkout  (Task  1:  10-80) 


text  for  each  subtask  and  the  controls,  displays,  skills, 
and  knowledge  associated  with  the  subtask.  Task  1  was 
detailed  only  to  the  level  required  to  link  Task  1  and 
Task  2.  The  details  of  the  subtasks  were  greatly  ab¬ 
breviated  to  reduce  or  eliminate  redundancy  of  activities 
which  are  required  by  the  actual  procedures,  both  for  the 
operational  equipment  and  for  the  trainer.  Photographs  and 
accompanying  text  were  provided  to  indicate  location  of 
equipment;  a  listing  of  the  associated  displays  and  con¬ 
trols  was  also  provided. 

Raters.  Six  AIR  staff  members  participated  in  this 
study.  These  raters  had  differing  degrees  of  familiarity 
with  each  of  the  training  devices,  tasks,  and  DEFT  itself: 

Raters  1  and  2:  Very  familiar  with  DEFT,  BOT; 

familiar  with  VIGS;  unfamiliar  with 
MPS 

Raters  3  and  4:  Unfamiliar  with  DEFT,  BOT,  and  VIGS; 

very  familiar  with  MPS 

Raters  5  and  6:  Familiar  with  DEFT,  BOT,  and  VIGS; 

unfamiliar  with  MPS. 


We  planned  to  examine  the  impact  of  these  differences 
on  the  various  DEFT  ratings  and  outputs. 
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Procedure.  Packages  of  materials  were  prepared  for 
each  training  device.  The  packages  varied  in  the  quality 
and  quantity  of  information  provided.  Thus,  the  BOT  "pack¬ 
ages"  consisted  of  a  picture  of  the  device,  a  brief  en¬ 
gineering  description,  and  the  list  of  tasks  and  subtasks 
involved.  The  VIGS  package  was  the  actual  device  user's 
manual,  complete  with  pictures,  instructions  for  use,  and 
capabilities  of  the  device.  The  MPS  package  contained 
scores  of  pictures,  descriptions,  engineering  specifica¬ 
tions,  extracts  from  the  Technical  Manual  used  by  actual 
crewmen  on  the  E-3A  aircraft,  and  the  user's  manual  for 
MPS. 

Following  the  distribution  of  these  packages  to  each 
of  the  raters.  Raters  1-5  met  to  discuss  the  packages  and 
to  receive  instruction  on  how  to  use  DEFT.  It  was  decided 
that  the  sparse  information  available  regarding  the  BOT 
device  would  be  inadequate  for  the  purposes  of  this  study. 
(Although  in  a  "real-world"  application,  training  device 
evaluators  might  be  faced  with  similar  problems  —  i.e.,  a 
lack  of  detailed  information  —  our  primary  purpose  was  to 
determine  interrater  agreement.  If  each  rater  supplied  his 
own  set  of  assumptions  regarding,  e.g.,  training  proficien¬ 
cy  standards,  differences  in  ratings  could  not  be 
attributed  to  disagreements  regarding  DEFT.)  Thus,  the 
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raters  were  briefed  as  to  the  details  of  BOT,  both  as 
performed  on  the  training  devices  (BOT  and  VIGS)  and  as 
performed  on  the  M60  tank.  In  addition,  raters  were 
briefed  in  detail  on  the  E-3A  and  MPS  configurations  for 
the  tasks  under  investigation. 

DEFT  was  presented  and  discussed  at  the  "mechanical" 
level;  that  is,  raters  were  told  how  to  operate  the  com¬ 
puter  and  how  to  proceed  through  the  DEFT  analyses.  There 
was  no  discussion  as  to  the  meaning  or  interpretation  of 
the  various  judgments  and  scales;  we  hoped  that  the  infor¬ 
mation  provided  on  the  screen  would  be  sufficient. 

Following  this  meeting,  each  rater  was  given  a  DEFT 
program  diskette  and  a  data  diskette,  containing  the  neces¬ 
sary  data  bases.  Each  rater  then  processed  each  of  the 
three  training  devices  through  all  three  levels  of  DEFT. 
Raters  analyzed  BOT  first,  VIGS  second,  and  MPS  third,  com¬ 
pleting  all  DEFT  analyses  on  each  device  before  analyzing 
the  next  device. 

At  the  completion  of  these  analyses,  the  data  disket¬ 
tes  were  collected  and  the  raw  data  scanned.  A  cursory  ex¬ 
amination  of  these  data  revealed  that  the  information  con¬ 
tained  on  the  DEFT  screens  and  the  briefings  held  prior  to 
the  analyses  were  inadequate.  Examination  of  the  notes 


each  rater  kept  regarding  his  ratings  indicated  that  each 
was  operating  under  a  different  set  of  assumptions.  These 
differences  ranged  from  data  entry  conventions  (e.g.,  if  a 
Training  Principle  in  the  Acquisition  Efficiency  analyses 
of  DEFT  III  was  judged  to  be  "not  applicable,"  some  raters 
entered  "0,"  others  entered  "100,"  and  others  entered 
"999")  to  different  assumptions  regarding  trainee  charac¬ 
teristics  (e.g.,  some  raters  thought  the  trainees  for  the 
MPS  device  were  skilled  maintenance  crewmen,  while  others 
thought  that  they  were  naive  crewmen,  while  others  thought 
that  they  were  naive  graduates  of  a  Technical  School,  with 
no  aircraft  experience) .  Thus,  it  was  decided  to  reconvene 
the  raters  to  discuss  the  devices  and  clarify  assumptions. 
Following  these  discussions,  changes  in  ratings  were  re¬ 
entered  by  the  individual  raters.  Because  of  logistic  con¬ 
straints,  Raters  5  and  6  could  not  attend  this  meeting; 
therefore,  their  results  were  not  included  in  further 
analyses. 

Results 

Output  indexes.  At  each  level  of  DEFT,  seven  output  in¬ 
dexes  are  computed  for  a  training  device  evaluation  (al¬ 
though  different  numbers  and  types  of  ratings  are  involved 
in  the  different  DEFT  levels).  These  seven  are: 


1)  Training  Problem  (TP) 

2)  Acquisition  Efficiency  (AE) 

3)  Acquisition  (A);  computed  as  TP/AE 

4)  Transfer  Problem  (TRP) 

5)  Transfer  Efficiency  (TE) 

6)  Transfer  (T) ;  computed  as  TRP/TE 

7)  Total  Score;  computed  as  A  +  T 

Theoretically,  these  indexes  should  be  equivalent  across 
all  three  levels  of  DEFT  for  a  particular  training  device 
evaluation,  since  the  successively  more  detailed  levels  of 
DEFT  are  designed  to  be  componential  assessments  of  more 
global  judgments.  Thus,  the  first  question  we  will  examine 
is  whether  raters  were  "internally  consistent":  For  each 
index  on  each  training  device,  do  the  scores  for  the  dif¬ 
ferent  levels  of  DEFT  agree? 

Relevant  data  are  shown  in  Tables  25  -  27.  Table  25 
shows  obtained  indexes  for  each  rater  on  the  BOT  device  for 
all  levels  of  DEFT;  Table  26  shows  the  same  information  for 
the  VIGS  device;  and  Table  27  shows  the  same  information 
for  the  MPS  device.  Note  that  these  data  were  obtained  af¬ 
ter  the  second  meeting  of  the  raters,  where  assumptions  in¬ 
volved  and  interpretations  of  the  scales  were  discussed. 


TABLE  26.  DEFT  INDEX  VALUES:  VIGS 


The  logical  question  to  ask  first  is  what  an  "accep¬ 
table"  level  of  internal  consistency  would  be.  How  close 
to  one  another  should  we  desire  that  these  indexes  be? 

This  is  an  arbitrary  decision;  however,  considering  the 
results  of  the  Monte  Carlo  analyses  discussed  in  previous 
sections,  it  is  clear  that  the  data  shown  in  these  tables 
for  DEFT  I  and  DEFT  III  are  internally  consistent.  Of  the 
84  pairs  (3  devices  x  4  raters  x  7  indexes)  of  DEFT  I  and 
DEFT  III  indexes,  70  (83.3%)  are  within  20  points  of  each 
other, and  about  half  are  within  10  points  of  each  other. 
Furthermore,  most  of  the  large  disagreements  are  due  to 
arithmetic  combinations  of  smaller  disagreements.  For  ex¬ 
ample,  consider  Rater  2,  BOT: 


DEFT  I 

DEFT  III 

TRP 

21.0 

33.1 

TE 

0.25 

0.21 

T 

84.0 

157.0 

Total  Score 

122.8 

192.4 

The  relatively  small  difference  in  TRP  is  magnified  by  the 
very  small  difference  in  TE  to  produce  large  differences  in 
T  and  Total  Score.  This  also  may  have  been  anticipated 
from  the  Monte  Carlo  sensitivity  analyses:  small 
differences  in  the  Efficiency  indexes  will  have  large 


effects  on  summary  indexes.  If  these  cumulative 
differences  are  taken  into  account,  it  appears  that  DEFT  I 
and  DEFT  III  indexes  are  internally  consistent. 

On  the  other  hand,  DEFT  II  indexes  are  substantially 
higher  than  either  DEFT  I  or  DEFT  III  in  practically  all 
cases.  A  closer  examination  of  the  data  reveals  that  the 
problems  seem  to  be  with  the  TP  and  TRP  indexes  (the 
Training  and  Transfer  Problems,  respectively) .  Each  is  ap¬ 
proximately  twice  as  large  for  DEFT  II  than  for  the  others. 

This  anomaly  can  be  explained  by  examining  how  these 
indexes  are  derived  for  DEFT  II  as  compared  to  DEFT  I  and 
DEFT  III.  In  both  of  the  latter  cases,  TP  and  TRP  are  mul- 
tipli  cative  functions  of  two  ratings:  Performance  Deficit 
and  Performance  Difficulty.  Thus,  in  DEFT  I,  if  a  training 
device  objective  is  judged  to  contain  50%  skills  and 
knowledge  not  possessed  by  trainees,  and  these  skills  and 
knowledge  are  judged  to  be  moderately  difficult  to  learn  — 
e.g.,  they  are  rated  at  "50"  on  the  Performance  Difficulty 
scale  —  the  TP  score  will  be  (50  x  50)/100  =  25.  However, 
in  DEFT  II,  the  judgment  made  as  to  the  Performance  Deficit 
is  a  simple  "yes"  or  "no"  (can  do  or  can't  do)  for  each 
task  contained  in  the  training  objective.  Thus,  the 
multiplicative  combination  of  deficit  and  difficulty  is  not 


contained  in  DEFT  II.  In  fact,  when  the  DEFT  II  indexes 
are  modified  by  encorpor at ing  either  DEFT  I  or  DEFT  III 
Performance  Deficit  ratings,  the  DEFT  II  indexes  dovetail 
precisely  with  the  other  indexes.  (These  recalculated  in¬ 
dexes  are  not  shown.) 

The  other  relatively  minor  inconsistencies  in  these 
data  are  in  the  Efficiency  indexes  (AE  and  TE)  of  DEFT  III. 
In  most  cases  (19  out  of  24),  the  DEFT  III  Efficiency  in¬ 
dexes  are  the  lowest  of  the  three  (although  in  most  cases 
these  differences  are  quite  small).  In  post-rating  discus¬ 
sions,  the  raters  felt  that  this  was  partially  due  to  an 
"oversegmentation"  problem:  many  of  the  eleven  Training 
Efficiency  and  eight  Transfer  Efficiency  principles 
received  quite  low  ratings  when  applied  to  subtasks.  For 
example,  augmenting  feedback  for  a  relatively  trivial  sub¬ 
task  such  as  "Indexes  ammunition"  would  quite  reasonably 
not  be  included  as  an  instructional  feature  of  the  VIGS 
device;  nevertheless,  VIGS  was  "penalized"  with  a  low 
Efficiency  rating  for  this  principle. 

Part  of  this  problem  is  a  terminological  artifact  of 
the  particular  tasks  and  subtasks  selected  for  this  study. 
While  we  termed  "Indexes  ammunition"  a  subtask,  in  standard 
task  analyses  it  would  probably  be  considered  a  "step"  or  a 


"behavioral  element."  The  resolution  of  the  Efficiency 
index  problem  will  involve  either  "tightening  up"  DEFT 
input  requirements  (e.g.,  by  specifying  task-analytic 
procedures  and  definitions  for  determining  "tasks"  and 
"subtasks"),  or  by  conducting  DEFT  III  Efficiency  analyses 
at  the  task  level. 

The  next  question  that  can  be  addressed  by  the  ex¬ 
amination  of  these  data  is  interrater  agreement  within  and 
across  devices  for  .these  indexes.  Thus,  for  example,  do 
raters  agree  on  the  TP  value  for  VIGS?  Again,  the  question 
as  to  what  would  constitute  "agreement"  must  be  arbitrarily 
answered.  Standard  correlational  techniques  are  not 
meaningfully  interpretable  with  small  sample  sizes.  Thus, 
we  will  examine  interrater  agreement  descriptively. 

When  one  closely  examines  Tables  25  -  27,  one  can  only 
be  impressed  by  the  equivalence  of  the  indexes  across 
raters  for  all  three  training  devices.  With  the  exception 
of  the  Total  Score  and  an  occasional  "deviant"  point,  all 
indexes  are  within  a  few  point  of  one  another.  Considering 
the  range  of  values  that  these  indexes  can  take  and  the  ex¬ 
pected  magnitude  of  difference  scores  as  demonstrated  by 
the  Monte  Carlo  analyses,  this  correspondence  is  excellent. 
If  the  100-point  scales  were  converted  to  discrete  5-  or 


7-point  scales,  interrater  agreement  would  be  almost 
perfect . 

Again,  we  must  note  that  these  data  were  obtained  fol¬ 
lowing  a  discussion  among  the  raters;  this  discussion  un¬ 
doubtedly  pulled  the  ratings  closer  together.  (Countering 
this,  however,  is  that  discussions  were  of  the  rating 
scales ,  not  of  the  summary  indexes.)  The  picture  of  inter¬ 
rater  agreement  prior  to  the  discussion,  while  still  quite 
good,  was  not  quite  so  rosy.  As  was  mentioned  previously, 
differing  interpretations  and  rating  conventions  (par¬ 
ticularly  with  respect  to  scoring  rules  for  the  Efficiency 
scales)  resulted  in  many  index  values  that  were  not  compar¬ 
able.  For  example,  when  a  Training  Principle  was  judged  as 
"not  applicable,"  some  raters  scored  the  scale  as  "zero," 
others  as  "100,"  and  others  as  "999."  Clearly,  it  would 
not  make  sense  to  compare  indexes  derived  for  these  dif¬ 
ferent  raters. 

The  major  discrepancy  in  these  comparisons  is  the  dis¬ 
agreements  in  the  Total  Scores.  Paralleling  the  above  dis¬ 
cussions,  we  attribute  these  differences  to  the  cumulative 
effects  of  smaller  differences  in  individual  component  in¬ 
dexes;  furthermore,  many  of  the  Total  Score  differences  can 
be  traced  to  the  large  impacts  of  the  Efficiency  indexes. 


62 


One  possible  solution,  as  suggested  by  the  Monte  Carlo 
analyses,  is  to  transform  the  Efficiency  indexes  (e.g.,  by 
using  a  square  root).  While  this  reduces  the  problem,  it 
does  not  eliminate  it;  however,  this  manipulation,  plus  the 
adoption  of  the  suggestion  to  conduct  DEFT  III  Efficiency 
analyses  at  the  task  (rather  than  the  subtask)  level,  would 
produce  significant  convergence  in  Total  Scores. 

In  summary,  these  data  indicate  substantial  interrater 
agreement  for  all  DEFT  indexes  and  across  the  three 
devices.  This  is  even  more  encouraging  when  one  considers 
first  that  the  raters  had  different  degrees  of  familiarity 
with  DEFT  and  the  three  devices,  and  second  that  the  three 
devices  were  of  quite  different  sorts.  The  next  issue  to 
examine  is  whether  these  levels  of  interrater  agreements 
are  maintained  when  the  individual  scales  are  examined. 

Individual  scales.  Table  28  shows  the  average  pairwise 
agreements  among  the  four  raters  for  each  of  the  eight 
DEFT  I  scales.  These  figures  were  computed  by  taking  the 
absolute  differences  between  each  pair  of  raters  on  each 
scale  judgment,  adding  them,  and  calculating  a  mean  and 
standard  deviation.  Since  all  raters  rated  all  dimensions, 
there  were  six  differences  that  were  combined  for  each 
entry  in  the  table.  In  addition,  row  and  column  means  of 
these  mean  differences  are  shown. 
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TABLE  28.  MEANS  AND  STANDARD  DEVIATIONS  OF  PAIRED  RATER  COMPARISONS 
FOR  EACH  TRAINING  DEVICE  -  DEFT! 


Question 


PD 

LD 

TA 

RD 

RLD 

PS 

FS 

TT 

11.67 

(6.88) 

0.0 

(0.0) 

8.17 

(4.95) 

5.00 

(2.89) 

5.00 

(2.89) 

5.33 

(2.63) 

9.17 

(5.34) 

9.17 

(5.34) 

6.69 

(1.80) 

12.17 

(7.06) 

5.83 

(3.44) 

14.17 

(6.72) 

5.00 

(5.00) 

9.17 

(5.34) 

10.00 

(7.07) 

12.50 

(9.47) 

14.17 

(6.72) 

10.38 

(3.02) 

9.17 

(5.34) 

10.00 

(5.77) 

5.83 

(3.44) 

5.00 

(5.00) 

12.83 

(8.15) 

13.33 

(6.24) 

10.00 

(7.07) 

11.67 

(6.87) 

9.73 

(2.58) 

11.0 

5.28 

9.39 

5.00 

9.00 

9.56 

10.56 

11.67 

GRAND  x 
8.93 
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As  could  be  surmised  from  the  discussion  above  con¬ 


cerning  the  output  indexes,  interrater  agreement  for  each 
of  the  underlying  scales  was  also  quite  substantial. 
Overall,  the  average  disagreement  was  approximately  9 
points  (on  a  hundred-point  scale) ,  well  within  what  could 
be  considered  acceptable  levels  of  agreement.  For  the  in¬ 
dividual  scales,  the  average  disagreement  was  between  5.0 
and  11.67  points,  with  no  particular  scale  having  an  un¬ 
usually  high  level  of  disagreement.  Likewise,  the  three 
devices  all  showed  equivalent  levels  of  agreement. 

Tables  29  and  30  show  the  equivalent  data  for  DEFT  II 
and  DEFT  III.  Again,  with  minor  discrepancies,  interrater 


agreement  was  high  for  all  scales  for  the  DEFT  models  on 
all  three  devices.  The  conclusions  to  draw  from  these 
tables  are  the  same  as  were  made  above  for  the  summary  in¬ 
dexes:  Interrater  agreement  for  DEFT  is  encouragingly  high, 
especially  given  differences  among  raters  with  respect  to 


familiarity  with  DEFT  and  the  three  devices;  and  the  level 
of  interrater  agreement  demonstrated  would  support  the  con¬ 
tinued  development  and  use  of  DEFT  for  the  evaluation  of 
training-device-based  training  systems. 
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TABLE  29.  MEANS  ANO  STANDARD  DEVIATIONS  OF  PAIRED  RATER  COMPARISONS 
FOR  EACH  TRAINING  DEVICE  -  DEFT  II 


PD  LD 


Device 


BOT 

Taskl 

X 

0.0 

10.83 

6 

(0.0) 

(5.34) 

Task2 

7 

0.0 

5.0 

6 

(0.0) 

(5.0) 

E3A 

Taskl 

X 

0.0 

8.33 

6 

(0.0) 

(5.53) 

Task2 

X 

0.0 

12.50 

(5 

(0.0) 

(7.50) 

VIGS 

Taskl 

X 

0.0 

9.33 

6 

(0.0) 

(5.31) 

CM 

to 

ra 

h- 

X 

0.0 

10.17 

6 

(0.0) 

(5.87) 

X 

0.0 

9.36 

6 

(0.0) 

(2.56) 

Question 


RD 

RLD 

PS 

FS 

Mean 

0.0 

(0.0) 

11.67 

(6.87) 

10.83 

(5.34) 

7.50 

(4.79) 

6.81 

0.0 

(0.0) 

5.0 

(2.89) 

5.0 

(5.0) 

13.33 

(9.43) 

4.72 

0.0 

(0.0) 

2.50 

(2.50) 

10.0 

(7.07) 

10.0 

(5.77) 

5.14 

0.0 

(0.0) 

10.0 

(7.07) 

11.67 

(6.87) 

18.33 

(10.67) 

8.75 

0.0 

(0.0) 

15.0 

(8.66) 

11.67 

(6.87) 

16.67 

(7.45) 

8.78 

0.0 

(0.0) 

10.17 

(5.27) 

12.50 

(7.50) 

15.00 

(7.64) 

7.97 

0.0 

(0.0) 

9.06 

(4.55) 

10.28 

(2.72) 

13.47 

(4.10) 

7.03 

Acquisition  Transfer 

Efficiency  Efficiency 


Device 


BOT  x 

6 

E3A  x 
6 


2.50 

5.17 

3.84 

(2.5) 

(2.99) 

3.50 

6.83 

5.06 

(2.18) 

(4.76) 

7.08 

8.83 

7.95 

(3.36) 

(5.15) 

VIGS  x 
<5 


TABLE  30.  MEANS  AND  STANDARD  DEVIATIONS*  OF  PAIRED  RATER  COMPARISONS 
FOR  EACH  TRAINING  DEVICE  -  DEFT  III 


Question 


Burst  on  Tarqet 
Task  1 

PD 

LD 

TA 

RD 

RLD 

TT 

Index 

Ammunition 

X 

a 

0.00 

(0.00) 

0.00 

(0.00) 

24.64 

(12.89) 

- 

0.00 

23.75 

Turn  on  Main 

Gun  Switch 

7 

a 

0.50 

(0.50) 

0.00 

(0.00) 

26.36 
(  9.59) 

- 

- 

- 

Announce 

Identified 

X 

a 

0.00 

(0.00) 

0.00 

(0.00) 

22.44 

(12.04) 

- 

- 

- 

Apply  Lead 

X 

0.67 

0.17 

11.80 

0.00 

0.33 

13.92 

(Simulated) 

a 

(0.47) 

(0.17) 

(  9.11) 

(0.24) 

(  6.30) 

Lay  Crosshair 

X 

0.00 

0.00 

11.44 

_ 

0.00 

16.94 

Leadl ine 

a 

(0.00) 

(0.00) 

(  9.34) 

(0.00) 

(  9.62) 

Fire  Main 

Gun 

X 

a 

0.67 

(0.47) 

0.00 

53.46 

- 

- 

- 

Task  2 


Sense  x 

Round  a 

0.00 

(0.00) 

0.00 

(0.00) 

8.05 

4.84 

0.00  0.33 

(0.24) 

12.21 
(  6.87) 

Announce  7 

Sensing  &  "BOT"  a 

0.67 

(0.47) 

0.00 

(0.00) 

9.73 
(  7.03) 

0.00 

(0.00) 

18.67 
(  6.60) 

Relay  to  New  7 
Aiming  Point  o 

1.00 

(0.58) 

0.00 

(0.00) 

6.99 

(4.46) 

0.00 

(0.00) 

6.44 
(  3.82) 

Fire  Main  7 

Gun  a 

0.67 

(0.47) 

0.00 

48.91 

- 

- 

E3A 

Task  1 

Ensure  Power  &  7 
Cooling  Avail .  a 

1.33 

(0.94) 

0.00 

(0.00) 

28.11 

(13.90) 

- 

- 

Turn  on  NCS  7 

Power  on  a 

1.83 

(1.07) 

0.00 

(0.00) 

28.23 

(13.77) 

0.17 

(0.10) 

24.94 

(11.68) 

★ 


Standard  deviations  are  provided  when  more  than  two  raters  supplied 
a  rating. 


Table  30  (Continued) 


PD 

LD 

TA 

R_D 

RLD 

TT 

Turn  Autopilot 
Off 

X 

a 

2.33 

(1.37) 

0.00 

(0.00) 

31  .55 

- 

0.17 

26.25 

Turn  Probe 
Heaters  Off 

X 

a 

2.33 

0.37) 

0.00 

(0.00) 

31  .55 

- 

0.17 
No.  D,G 

27.50 
No.  D,G 

Synchronize  x 

Horizontal  a 

Situation  Indicators 

1 .83 
0.07) 

0.00 

(0.00) 

4.18 

(1.52) 

- 

0.11 

(0.08) 

22.50 
(  8.81) 

INS-1  &  INS-2 
to  Align  Mode 

X 

a 

1 .83 
(1.07) 

0.00 

(0.00) 

14.97 
(  6.24) 

- 

0.11 

(0.00) 

25.00 

(11.90) 

Test  UDC 

Display  &  Lamps 

7 

o 

1 .83 
0.07) 

0.17 

(0.17) 

14.85 
(  5.87) 

- 

0.11 
(0.08) 
No.  G 

24.17 
(  9.80) 
No.  G 

Detect 

Fault  10 

7 

a 

2.17 

(1.57) 

0.00 

(0.00) 

37.33 

(13.56) 

- 

- 

- 

Task  2 

CDUs 

7 

a 

2.00 

0.41) 

0.00 

(0.00) 

24.97 

(12.23) 

- 

0.11 

(0.08) 

15.08 
(  7.07) 

Sim.  Restart, 
Perform 

Checkout 

X 

a 

2.50 

0 .50) 

0.00 

39.73 

- 

0.33 

(0.24) 

19.90 

(16.07) 

INUs 

X 

a 

2.00 

0.41) 

0.00 

(0.00) 

24.97 

(12.23) 

- 

0.00 

(0.00) 

15.08 
(  7.07) 

Sim.  Res  cart, 
Perform 

Checkout 

7 

a 

2.50 

(1.50) 

0.00 

48.73 

- 

0.36 

(0.26) 

19.90 

(16.07) 

Check  115  VAC 
Power 

7 

a 

1.50 

(1.50) 

0.17 

(0.17) 

22.70 

(11.59) 

- 

0.78 

(0.44) 

9.25 
(  4.81) 

Sim.  Continuity 
Check,  Check 
Wiring  Continui 

7 

a 

ty 

2.00 

0.41) 

0.22 

(0.16) 

27,64 

(10.48) 

- 

0.33 

(0.14) 

13.00 
(  5.70) 

Sim.  Replace, 
of  Capacitor, 

7 

a 

0.50 

(0.50) 

0.00 

(0.00) 

29.27 

(10.72) 

- 

0.00 

19.50 

Replace  Shorted  Capacitor 


>,v  \ 


"  * 
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Table  30  (Continued) 

PD 

LD 

TA 

RD 

RLD 

TT 

Sim.  Restart, 
Perform 

Checkout 

VIGS 

Task  1 

X 

cr 

2.50 

(1.50) 

0.00 

48.73 

0.33 

(0.24) 

25.17 

(17.80) 

Index 

Ammuni tion 

X 

a 

0.50 

(0.50) 

- 

- 

- 

- 

- 

Turn  on  Main 

Gun  Switch 

x 

a 

0.50 

(0.50) 

- 

- 

- 

- 

- 

Announce 

IDENTIFIED 

X 

a 

0.67 

(0.47) 

0.00 

27.36 

- 

- 

- 

Apply 

x 

0.00 

0.00 

13.65 

_ 

0.33 

10.63 

Lead 

a 

(0.00) 

(0.00) 

(  8.64) 

(0.24) 

(  4.83) 

Lay  Crosshair 

X 

0.50 

0.00 

11.77 

- 

0.00 

6.98 

Leadl ine 

a 

(0.50) 

(0.00) 

(  5.41) 

(0.00) 

(4.72) 

Fire  Main 

Gun 

X 

a 

0.50 

(0.50) 

- 

- 

- 

- 

- 

Task  2 

Sense 

X 

0.67 

0.00 

18.76 

- 

0.00 

10.46 

Round 

a 

(0-47) 

(0.00) 

(10.28) 

(0.00) 

(  5.02) 

Announce 

X 

0.00 

0.00 

24.86 

— 

0.00 

14.67 

Sensing  &  "BOT" 

a 

(0.00) 

(0.00) 

(11.59) 

(0.00) 

(  5.35) 

Relay  to  New 

X 

1.17 

0.22 

18.44 

0.00 

12.08 

Aiming  Point 

a 

(0.69) 

(0.16) 

(10.70) 

(0.00) 

(  5.76) 

Fire  Main 

X 

0.50 

_ 

- 

- 

- 

- 

Summary 


Based  on  the  analyses  presented  in  this  report,  a  num¬ 
ber  of  recommendations  can  be  made  regarding  modifications 
of  DEFT: 

1.  The  expected  distribution  of  summary  index  scores 
is  too  large  to  provide  for  meaningful  interpretations  of 
DEFT  output,  unless  various  assumptions  are  made  regarding 
the  expected  distributions  of  input  variables  in  the  real 
world.  All  of  the  assumptions  we  made  are  defensible 
(e.g.,  a  training  device  will  not  be  built  that  addresses 
no  performance  deficit,  etc.);  however,  a  different  set  of 
assumptions  would  result  in  different  critical  values  for 
inter-device  comparisons. 

2.  The  major  contributors  to  output  variance  are  the 
two  Efficiency  scales.  To  reduce  this  problem,  it  is 
recommended  that  some  transform  (e.g.,  square  root)  be 
used . 

3.  It  is  recommended  that  two  additional  scales  be 
added  to  the  DEFT  II  analyses.  These  scales  would  assess 
the  proportion  of  required  skills  and  knowledge  contained 
in  the  training  device  requirement  and  the  operational 
performance  objective  that  the  trainees  do  not  possess. 


4.  It  is  imperative  that  when  more  than  one  rater 
applies  DEFT  to  the  evaluation  of  a  device,  the  raters 
agree  on  their  assumptions  regarding  the  device,  trainee 
population,  device  utilization,  and  the  meanings  of  the 
various  DEFT  scales  prior  to  conducting  analyses. 

Based  on  these  results,  recommendations  2  and  3  above 
have  been  implemented  in  the  most  recent  DEFT  programs. 
Presumably,  the  remaining  recommendations  would  be  imple¬ 
mented  by  DEFT  users. 


